The goals of this lab are to become familiar with data visualization
using the R package ggplot2, which is an
implementation of the grammar of graphics approach to data
visualization while also reviewing some of the statistical concepts from
Chapter 2.
As preparation, it’s a good idea to read through Chapters 2 & 3 of the book and also try doing the Questions and Exercises there.
Work through this lab by running all the R code on your computer, chunk by chunk and please make sure that you understand the inputs and the outputs of each operation.
Once you get to the end of this lab, don’t forget to go to Canvas after Friday evening to answer the questions in Quiz 2 before the deadline on Monday.
Please do NOT work with other students to answer the questions on Canvas. You will need a Stanford ID to log in to Canvas. –>
RStudio cheatsheets are great for quick reference guides to popular R packages like ggplot2 and dplyr.
A useful resource to consult if you have questions on the details of
the code is the ggplot2 documentation website ggplot2. You may
want to keep that website open as a reference while working on this
lab.
The code below checks if all packages required for today’s lab are available and installs any missing ones.
pkgs_needed = c("tidyverse", "Biostrings",
"parathyroidSE", "EnsDb.Hsapiens.v86","hexbin","vcd")
letsinstall = setdiff(pkgs_needed, installed.packages())
if (length(letsinstall) > 0) {
BiocManager::install(letsinstall)
}
knitr::opts_chunk$set(tidy = FALSE, cache = TRUE, autodep = TRUE,
dev = "png", dpi = 300,
size = "small",
message = FALSE, error = FALSE, warning = TRUE)
library("dplyr")
library("ggplot2")
library("Biostrings")
library("readr")
First, we will use a data set about births in the USA in 2006. You can read more about this in the “R in a Nutshell, 2nd edition” book which is freely available as a PDF file online. This is convenient since you can follow the same methods in that book (except translating into the grammar of graphics to make plots).
To load the data, download the raw file from here: https://github.com/cran/nutshell/blob/master/data/births2006.smpl.rda Then, save the file locally on your computer. You will need to enter the path to your local file in the code chunk below.
After we load the dataset, we add two new variables: one to encode whether the day falls on a weekend or whether it is a weekday, and a three-level, discretized health score based on the APGAR 5 score.
We also fix an idiosyncratic encoding of not available values by 99
with proper NAs.
Finally, we define a subsampled version of the dataset
births_small to speed up some of the plotting (feel free to
try out the computations and plots we see subsequently with the full
dataset births.)
# Enter the path to your local copy of births2006.smpl.rda below.
# You can download the file at the link above.
# load("INSERT_PATH_TO_YOUR_LOCAL_COPY_OF_births2006.smpl.rda")
load("../data/births2006.RData")
births2006 = births2006.smpl
births = mutate(births2006,
WEEKEND = ifelse(DOB_WK %in% c(1, 7), "Weekend", "Weekday"),
HEALTH = c("CritLow", "Low", "Normal")[ 1+findInterval(APGAR5, c(3.5, 6.5)) ],
ESTGEST = replace(ESTGEST, ESTGEST==99, NA))
set.seed(1)
births_small = births[ sample(nrow(births), 40000), ]
head(births_small)
## DOB_MM DOB_WK MAGER TBO_REC WTGAIN SEX APGAR5 DMEDUC
## 2518571 2 2 24 3 44 F 9 2 years of high school
## 3136811 6 5 21 2 39 M 5 NULL
## 1498513 3 3 29 2 29 F 9 2 years of college
## 596536 9 5 35 4 NA F NA NULL
## 451506 6 4 35 4 NA F NA NULL
## 1447720 7 4 33 4 60 M 9 2 years of high school
## UPREVIS ESTGEST DMETH_REC DPLURAL DBWT WEEKEND HEALTH
## 2518571 10 37 Vaginal 1 Single 2778 Weekday Normal
## 3136811 8 39 Vaginal 1 Single 3160 Weekday Low
## 1498513 16 37 C-section 1 Single 3040 Weekday Normal
## 596536 13 NA C-section 1 Single 3317 Weekday <NA>
## 451506 10 NA Vaginal 1 Single 2778 Weekday <NA>
## 1447720 15 40 Vaginal 1 Single 3969 Weekday Normal
The births2006 dataset was originally distributed
with the nutshell package,
which is no longer available on CRAN.
You can still find details of the data in each column at https://rdrr.io/cran/nutshell/man/births2006.smpl.html
geomsggplot2 is a plotting system for R, based on the grammar
of graphics. It takes care of many of the fiddly details that make
plotting a hassle (like drawing legends) as well as providing a powerful
model of graphics that makes it easy to produce complex multi-layered
graphics. You can find out more at https://ggplot2.tidyverse.org/reference/.
You can think of plots generated with ggplot2 as
sentences. The function ggplot() starts a sentence
(initializes a plot) which usually contains a noun/subject (a
dataset with information to be plotted).
The + operator is used to add further clauses/fragments
containing “verbs”, “adjectives”, “adverbs” etc. to the sentence. This
allows you to construct sophisticated plots from a few elementary
building blocks, just like forming compound sentences from simple
phrases.
In ggplot world, verbs are geometric objects, or
geoms. For example, geom_bar() draws
bars, geom_point() draws points. There can be multiple
geoms in a sentence, and these are then also called
layers.
For each geom, you need to define required
aesthetics using the function aes(). The
aesthetics of a geom determine how the attributes of the
geometric object (e.g. the x- and y- coordinates of the points, their
colors and shapes, etc.) are mapped to the columns of the supplied
data.frame. The aesthetics mapping can be defined when initializing a
plot (with ggplot(data = dataframe_name, aes(...))), which
makes it apply to all geoms. Otherwise, one can specify
aes() mapping for each geom separately
(e.g. geom_bar(aes(...))), in which case it applies to that
particular geom only.
It is also possible to use a different dataset for geom
than the one supplied to ggplot(). To do this you simply
supply a new data.frame as a data argument of
geom
(e.g. geom_point(data = new_dataframe_name, aes(...)))
which would overwrite the dataset used to this geom.
For example, let us look at the number of births on each day of the
week in our sample data. First set up our plot with the data
births.
ppp = ggplot(births_small)
Note that this doesn’t actually plot anything (just like if you wrote
junk = 7, this does not output anything until you run
junk. What happens here when you run ppp?
The number of births per day of the week can be shown easily in a
barplot, so let us use that. To create a geometric object layer for a
barplot we use the function geom_bar(). In
order for it to know what part of the data we want to actually plot, we
need to give an aesthetic. We can do this by declaring that the
aesthetic for the x-axis is the day of the week of the
birth.
The column DOB_WK gives the day of the week that each
birth happened as a numeric value 1 (Sunday) through 7 (Saturday). We
can tell R that this is actually a factor variable by putting the
variable name in the function factor. Putting this all
together, we get geom_bar(aes(x = factor(DOB_WK)). Finally,
to add this layer to the initialized graph, we add the geom
to ppp with the + operator.
ppp + geom_bar(aes(x = factor(DOB_WK)))
Doctors are able to delay birth with anti-contraction medications or labor suppressants called tocolytics. That might be one reason we see fewer births on day 1 of the week (Sunday) and day 7 (Saturday).
We can get further information from this plot if we add more
aesthetics. For example, maybe we can fill each bar with different
colors corresponding to what type of birth it was (“C-section”,
“Vaginal”, or “Unknown”). We can do this by just including another
aes in the geometric object. Start with the same
initialization, but add a geometric object with the x-axis and also fill
color defined.
ppp + geom_bar(aes(x = factor(DOB_WK), fill = DMETH_REC))
When we made that plot, we used the default value for the
position argument of the geom_bar function. We
could have equivalently written the code as follows.
ppp + geom_bar(aes(x = factor(DOB_WK), fill = DMETH_REC), position = "stack")
Another option is to use position = "dodge". Note that
this is an argument to geom_bar and not to
aes.
ppp + geom_bar(aes(x = factor(DOB_WK), fill = DMETH_REC), position = "dodge")
Now we see that about 1/2 as many C-sections were on weekends as there were on a typical weekday, whereas there were about 3/4 as many vaginal births on weekends as there were on weekdays.
Let us continue looking at the birth data on a day to day basis. We might conjecture that older women are less likely to take a tocolytic since they are more likely to have had other complications already. One way we can do this is to “facet” the graph and display day of the week versus women’s age.
First, let us make a histogram of the women’s ages
(MAGER) to get an idea of what the distribution looks
like.
ppp + geom_histogram(aes(x = MAGER), binwidth = 1)
We used the argument binwidth to set the width of the
bins in this histogram (here binwidth = 1 corresponds to 1
year).
Using the grammar of graphics, it is easy to facet this single graph
to make multiple graphs along another dimension of the data. In this
case, we’re interested in breaking this up along the dimension of day of
birth DOB_WK. We will add this facet with the command
facet_grid or facet_wrap. In
facet_grid, the argument we use is a formula with the rows
(of the tabular display) on the left hand side and the columns (of the
tabular display) on the right hand side (RHS).
A formula is created in R with the tilde operator ~. A
dot in the formula is used to indicate there should be no faceting on
this dimension (either row or column). The formula can also be provided
as a string instead of a classical formula object. In
facet_wrap, we have the same sort of argument, but we only
include a RHS of the formula. We’ll use both of them in an example so
you can see the difference.
Now let us look at this faceting on that variable. Again, we will use
the + operator. Here, we also see that we can save the
geometric objects in the plot and just add facets at the end.
ppph = ggplot(births_small) +
geom_histogram(aes(x = MAGER, fill = DMETH_REC), binwidth = 1)
ppph + facet_wrap( ~ DOB_WK)
ppph + facet_grid(DOB_WK ~ SEX)
What is the difference between facet_wrap and
facet_grid?
Here is an interesting perspective of the data (we use
dplyr::filter to exclude the record where the delivery
method is Unknown).
ggplot(dplyr::filter(births_small, !(DMETH_REC %in% "Unknown"))) +
geom_histogram(aes(x = MAGER, fill = factor(TBO_REC)), binwidth = 1) +
facet_grid(WEEKEND ~ DMETH_REC, scale="free_y", drop = TRUE) +
geom_vline(xintercept = seq(15, 45, by=5), alpha=0.2, color="white") +
labs(title = "Births in USA 2006", fill="Birth\nOrder")
Now try to make a plot with a white background. This page might be work reading.
Quiz question 1 : What do themes do in
ggplot2?
It’s often useful to transform your data before plotting, and that’s what statistical transformations do.
Here we look at the histogram of mothers’ ages as before, but we also add a density estimate to it.
ggplot(births_small, aes(x = MAGER)) +
geom_histogram(aes(y = ..density..), binwidth = 1, fill = "grey", col = "black") +
stat_density(col = "red", fill = NA)
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Maybe we want to compare this to a lognormal distribution’s density.
ggplot(births_small, aes(x = MAGER)) +
geom_histogram(aes(y = ..density..), binwidth = 1, fill="grey", col="black") +
stat_density(col="red", fill=NA) +
stat_function(fun = dlnorm, col = "blue",
args = list(mean = mean(log(births_small$MAGER)),
sd = sd(log(births_small$MAGER))))
Does the log normal distrubtion look like a good fit?
It is sometimes difficult to assess goodness of fit visually using a histogram. Quantile-quantile plots (Q-Q Plots) as described in chapter 2 are better suited for this task:
ggplot(births_small, aes(sample = MAGER)) +
stat_qq(distribution = stats::qlnorm,dparams=list(mean = mean(log(births_small$MAGER)),
sd = sd(log(births_small$MAGER))))+
stat_qq_line(distribution = stats::qlnorm,dparams=list(mean = mean(log(births_small$MAGER)),
sd = sd(log(births_small$MAGER))))
It is now easier to see that the model does not the data very well. We will return to Q-Q plots later in the lab.
You can add any function using the stat_function.
In this example, we will plot the birth weight in grams (DBWT) versus weight gain (WTGAIN) by mother (which seems to be in pounds). Unfortunately, we need to be a little more careful about this when there are NAs in the data.
ppp2 = ggplot(dplyr::filter(births_small, !is.na(WTGAIN) & !is.na(DBWT)),
aes(x = WTGAIN, y = DBWT)) +
labs(x = "Weight Gain by Mother", y = "Birth Weight in Grams")
ppp2 + geom_point()
When we plot this data by itself, it is not clear what is going on – there are many points plotted on top of each other and it just looks messy.
One way to transform this data with a stat_bin2d() in
ggplot2 is by binning the points as
below.
ppp2 + stat_bin2d()
See that the color axis reports counts. That means we count the total number of observations to fall within a box, i.e. we are effectively generating a 2D density plot. This plot seems to indicate that the joint distribution of weight gain by mother and birth weight in grams is unimodal.
Instead of just making counts though, we can compute other quantities
(evaluate functions other than count) within each of the 2d bins on
other two columns of births_small. For example, let us look
at the median number of weeks of gestation for each of the bins.
ppp2 + stat_summary_2d(aes(z = ESTGEST), fun = median) +
labs(title = "median number of weeks of gestation")
## Warning: Removed 55 rows containing non-finite values (`stat_summary2d()`).
If we want to do a regression, we can include this automatically. By
default, ggplot2 uses “locally weighted scatterplot
smoothing” (loess) regression for data sets with < 1000 points and
“generalized additive models” (GAM) for >= 1000 points.
# here we force a linear model fit
ppp2 + geom_point() + stat_smooth(method = lm)
By changing the color aesthetics of the modelled line, we can fit separate models to the corresponding subsets of the data and compare them. For example, let us look at a smoothed fit (GAM) of birthweight versus weight gain by mother separated out by baby boys and girls.
ppp2 + geom_point() + stat_smooth(aes(col = SEX))
Finally, let us look at hexagonal bins as a visually more attractive alternative to the rectangular bins used so far:
library("hexbin")
ppp2 + geom_hex(bins=30) + stat_smooth(aes(col = SEX))
Scales control the mapping between data and aesthetics. Going back to the 2d distribution of birthweight versus weight gain by mothers, it is difficult to see what is going on except that there is a dense region and a less dense region. If we take the square root of the number of births per region, then we can see that there is a smooth transition between the high density area and the low density area.
# we repeat again the code above so you don't gave to scroll to look at it.
ppp2 = ggplot(dplyr::filter(births_small, !is.na(WTGAIN) & !is.na(DBWT)),
aes(x = WTGAIN, y = DBWT)) +
labs(x = "Weight Gain by Mother", y = "Birth Weight in Grams")
ppp2 + stat_bin2d() + scale_fill_gradient(trans = "sqrt")
See what happens when you change the trans to
log10.
Sometimes, we might want to change the scale of the vertical axis. According to Wikipedia’s page on “Fetal Viability”, there is a 50% chance of viability at 24 weeks of gestation. We will repeat the plot above with birth weight on a log scale, so we get better separation of the mean weeks of gestation.
ppp2 + stat_summary_2d(aes(z = ESTGEST), fun = mean) +
scale_y_log10(limits = 10^c(2, 4)) +
scale_fill_gradient2(midpoint = 24) +
labs(title="mean number of weeks of gestation")
## Warning: Removed 55 rows containing non-finite values (`stat_summary2d()`).
Now let us look at only the quadruplets in the full dataset (there are 39 such observations). We want to include lots of variables such as number of prenatal visits (UPREVIS), the mother’s age (MAGER), the estimated weeks of gestation (ESTGEST), the delivery method (DMETH_REC), and the mother’s education level (DMEDUC).
ppp3 = ggplot(dplyr::filter(births, DPLURAL == "4 Quadruplet"),
aes(x = UPREVIS, y = MAGER)) +
geom_point(aes(size = ESTGEST, shape = DMETH_REC, col = DMEDUC)) +
stat_smooth(aes(col = DMETH_REC), method = "lm")
ppp3
## Warning: Removed 4 rows containing missing values (`geom_point()`).
ppp3 + scale_size(range=c(3, 6)) + scale_color_brewer(palette = "Set1") +
scale_shape(solid = FALSE)
## Warning: Removed 4 rows containing missing values (`geom_point()`).
Note that, while shapes are discrete, colors and sizes can both be scaled continuously.
Quiz question 2 : What is the maximum number of shapes that you are allowed to use in ggplot2 by default?
Quiz question 3 : Write the name of the function that you could use to make more than the maximum number of default shapes allowed. Hint: this function has “values” as one of the arguments ____(…, values = (…)).
Remember that in the the power calculations section at the end of the last lab, we defined a statistic estimating the deviation of the observed to expected frequencies of nucleotides. The code for generating the statistics assuming it came from the null distribution (i.e. when all 4 nucleotides are evenly likely).
oestat = function(o, e){
sum( (e-o)^2/e )
}
set.seed(1)
B = 10000
# here we pick an arbitrary length / not the same as for Celegans
n = 2847
expected = rep(n/4 ,4)
oenull = replicate(
B, oestat(e=expected, o=rmultinom(1,size = n, prob = rep(1/4,4))))
Now, we can estimate the null distribution of this statistics, by plotting a histogram of the generated values:
ggplot(data.frame(null_stats = oenull)) +
geom_histogram(aes(x = null_stats), bins = 100, boundary=0)
We compare the distribution of this statistic to \(Chi^2_3\) (df = 1 * (4-1) = 3) using a q-q plot:
ggplot(data.frame(stat = oenull), aes(sample = stat)) +
stat_qq(distribution = stats::qchisq, dparams = list(df = 3)) +
stat_qq_line(distribution = stats::qchisq, dparams = list(df = 3))
For discrete variables rootograms can be a useful tool for visualizing goodness of fit.
As an example we will generate data from a zero-inflated Poisson distribution (ZIP). There are packages in R that will do this directly but we can build our own function:
rzip<-function(n,pi,lambda){
u<-rbinom(n,1,prob=(1-pi))
x<-rpois(n,lambda)
y<-u*x
return(y)
}
We can generate a small sample from our new function and take a look at its distrubtion:
set.seed(7)
zip1<-rzip(1000,0.1,3)
ggplot(data.frame(counts=zip1),aes(x=counts))+
geom_histogram(bins=11,boundary=0)
If we saw this data in the wild, we might think it was Poisson. To
investigate how good a fit the Poisson model is, we revisit the
goodfit function from chapter 2 and use it to create a
rootogram.
library("vcd")
gf1<-goodfit(zip1,"poisson")
rootogram(gf1)
By default we have a “hanging” rootogram, where the observed counts are the gray bars which “hang” from the red points which are the expected counts assuming the Poisson model. We can see how well the model fits by inspecting the distance from the bottom of the gray bars to the x-axis. Alternative types of rootograms are “standing” and “deviation” shown below:
rootogram(gf1,type="standing")
rootogram(gf1,type="deviation")
In all three cases we can see that the model has trouble fitting the data, struggling with the zero counts as expected.
Quiz question 4 : What was the estimated value of
lambda for the Poisson fit of the zip1 data? (Hint: you can calculate it
directly or use the goodfit object)
We will use the parathyroidGenesSE package in
R. Load the data and read the experimental information and
the abstract.
library("parathyroidSE")
library("EnsDb.Hsapiens.v86")
data("parathyroidGenesSE", package = "parathyroidSE")
metadata(parathyroidGenesSE)$MIAME
## Experiment data
## Experimenter name: Felix Haglund
## Laboratory: Science for Life Laboratory Stockholm
## Contact information: Mikael Huss
## Title: DPN and Tamoxifen treatments of parathyroid adenoma cells
## URL: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE37211
## PMIDs: 23024189
##
## Abstract: A 251 word abstract is available. Use 'abstract' method.
abstract(metadata(parathyroidGenesSE)$MIAME)
## [1] "Primary hyperparathyroidism (PHPT) is most frequently present in postmenopausal women. Although the involvement of estrogen has been suggested, current literature indicates that parathyroid tumors are estrogen receptor (ER) alpha negative. Objective: The aim of the study was to evaluate the expression of ERs and their putative function in parathyroid tumors. Design: A panel of 37 parathyroid tumors was analyzed for expression and promoter methylation of the ESR1 and ESR2 genes as well as expression of the ERalpha and ERbeta1/ERbeta2 proteins. Transcriptome changes in primary cultures of parathyroid adenoma cells after treatment with the selective ERbeta1 agonist diarylpropionitrile (DPN) and 4-hydroxytamoxifen were identified using next-generation RNA sequencing. Results: Immunohistochemistry revealed very low expression of ERalpha, whereas all informative tumors expressed ERbeta1 (n = 35) and ERbeta2 (n = 34). Decreased nuclear staining intensity and mosaic pattern of positive and negative nuclei of ERbeta1 were significantly associated with larger tumor size. Tumor ESR2 levels were significantly higher in female vs. male cases. In cultured cells, significantly increased numbers of genes with modified expression were detected after 48 h, compared to 24-h treatments with DPN or 4-hydroxytamoxifen, including the parathyroid-related genes CASR, VDR, JUN, CALR, and ORAI2. Bioinformatic analysis of transcriptome changes after DPN treatment revealed significant enrichment in gene sets coupled to ER activation, and a highly significant similarity to tumor cells undergoing apoptosis. Conclusions: Parathyroid tumors express ERbeta1 and ERbeta2. Transcriptional changes after ERbeta1 activation and correlation to clinical features point to a role of estrogen signaling in parathyroid function and disease."
Parathyroid adenoma (http://en.wikipedia.org/wiki/Parathyroid_adenoma) is a benign tumor of the parathyroid gland. The abstract tells us that some interesting genes to look at are the following:
Let us put them in a table.
genes = read.csv(textConnection(
"name, group
ESR1, estrogen
ESR2, estrogen
CASR, parathyroid
VDR, parathyroid
JUN, parathyroid
CALR, parathyroid
ORAI2, parathyroid"),
stringsAsFactors = FALSE, strip.white = TRUE)
In the parathyroidGenesSE object, the features are
labeled with Ensembl gene identifiers, so let us use a Bioconductor
package to find the corresponding IDs.
ens = ensembldb::select(EnsDb.Hsapiens.v86,
keys = list(GenenameFilter(genes$name),
TxBiotypeFilter("protein_coding")),
columns = c("GENEID", "GENENAME"))
## Warning: 'GenenameFilter' is deprecated.
## Use 'GeneNameFilter' instead.
## See help("Deprecated")
ens =
dplyr::filter(ens, GENEID %in% rownames(parathyroidGenesSE)) %>%
mutate(group = genes$group[match(GENENAME, genes$name)])
ens
## GENEID GENENAME TXBIOTYPE group
## 1 ENSG00000179218 CALR protein_coding parathyroid
## 2 ENSG00000036828 CASR protein_coding parathyroid
## 3 ENSG00000091831 ESR1 protein_coding estrogen
## 4 ENSG00000140009 ESR2 protein_coding estrogen
## 5 ENSG00000177606 JUN protein_coding parathyroid
## 6 ENSG00000160991 ORAI2 protein_coding parathyroid
## 7 ENSG00000111424 VDR protein_coding parathyroid
Make the table of gene counts, add the patient info:
countData = assay( parathyroidGenesSE )
gene.counts = t(countData[ens$GENEID, ])
colnames(gene.counts) = ens$GENENAME
dat = cbind(data.frame(colData( parathyroidGenesSE)), data.frame(gene.counts))
head(dat)
## run experiment patient treatment time submission study sample
## 1 SRR479052 SRX140503 1 Control 24h SRA051611 SRP012167 SRS308865
## 2 SRR479053 SRX140504 1 Control 48h SRA051611 SRP012167 SRS308866
## 3 SRR479054 SRX140505 1 DPN 24h SRA051611 SRP012167 SRS308867
## 4 SRR479055 SRX140506 1 DPN 48h SRA051611 SRP012167 SRS308868
## 5 SRR479056 SRX140507 1 OHT 24h SRA051611 SRP012167 SRS308869
## 6 SRR479057 SRX140508 1 OHT 48h SRA051611 SRP012167 SRS308870
## CALR CASR ESR1 ESR2 JUN ORAI2 VDR
## 1 5055 14525 2 1 1799 632 1167
## 2 6161 16870 3 3 1153 1424 1653
## 3 3333 7798 1 1 1086 309 688
## 4 6407 14299 2 1 1227 974 1407
## 5 3810 9235 4 1 1258 326 702
## 6 4885 12994 2 3 906 814 1235
Plot one of the estrogen related gene’s counts (ESR1) with plot aesthetics and faceting to separate patient samples, treatments and times.
ggplot(dat, aes(col = patient, x = treatment, y = ESR1)) +
geom_point(size = 3) +
facet_grid( . ~ time)
Try to answer the following questions to check your understanding of the topics covered in this lab.
From the plot of the parathyroid data, answer the following.
Quiz question 5 : How many patient samples are there?
Quiz question 6 : How many time points are there?
Quiz question 7 : There were 3 treatments: “Control”, “DPN”, and “OHT”. How many measurements were taken from patient sample 2 under the DPN treatment?
Make your own plot of VDR versus CASR. (That is CASR, not CALR).
Quiz question 8 : Which patient sample has the highest recorded level of CASR?
Quiz question 9 : Which of the pairs of patient samples seem to be well separated in this plot (i.e., for which two patient samples can you draw a line on the plot that perfectly separates them)?
Quiz question 10 : Which patient sample looks different from the other three when you plot VDR versus ORAI2?
pkgs_needed = c("BSgenome.Hsapiens.UCSC.hg19", "Gviz")
letsinstall = setdiff(pkgs_needed, installed.packages())
if (length(letsinstall) > 0) {
BiocManager::install(letsinstall)
}
GC content is the percentage of the genome (or DNA fragment) that is “G” or “C”. To compute the GC content, we count the occurrences of the “G” and “C” alphabets, and divide by the length of the string in question.
We will be using data from chromosome 8 of the human genome version 19 from the UCSC genome repository.
A genomic window is defined as a CpG island if its GC content>50% and observed-to-expected CG ratio>0.6. CpG islands are said to mark important regions in the genome because over 65% of gene promoter regions can be found in with CpG islands.
We want to look at this for the Human Chromosome 8. We use the
Bioconductor package BSgenome.Hsapiens.UCSC.hg19.
library("BSgenome.Hsapiens.UCSC.hg19")
chr8 = Hsapiens$chr8
cpg_url = url("http://web.stanford.edu/class/bios221/data/model-based-cpg-islands-hg19.txt")
CpGtab = read.table(cpg_url, header=T)
nrow(CpGtab)
## [1] 65699
Then we retain only the start and stop positions for chromosome 8 and convert into an IRanges object:
irCpG = with(dplyr::filter(CpGtab, chr == "chr8"),
IRanges(start = start, end = end))
We create the biological context with the next line. The “I” in IRanges stands for “interval”; the “G” in GRanges for “genomic”.
grCpG = GRanges(ranges = irCpG, seqnames = "chr8", strand = "+")
genome(grCpG) = "hg19"
library("Gviz")
ideo = IdeogramTrack(genome = "hg19", chromosome = "chr8")
plotTracks(
list(GenomeAxisTrack(),
AnnotationTrack(grCpG, name = "CpG"), ideo),
from = 2200000, to = 5800000,
shape = "box", fill = "#006400", stacking = "dense")
Views on the chromosome sequence correspond to the CpG islands, irCpG, and to the regions in between (gaps(irCpG)).
CGIview = Views(unmasked(Hsapiens$chr8), irCpG)
NonCGIview = Views(unmasked(Hsapiens$chr8), gaps(irCpG))
We compute transition counts in CpG islands and non-islands using the data.
seqCGI = as(CGIview, "DNAStringSet")
seqNonCGI = as(NonCGIview, "DNAStringSet")
dinucCpG = sapply(seqCGI, dinucleotideFrequency)
dinucNonCpG = sapply(seqNonCGI, dinucleotideFrequency)
dinucNonCpG[, 1]
## AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
## 389 351 400 436 498 560 112 603 359 336 403 336 330 527 519 485
NonICounts = rowSums(dinucNonCpG)
IslCounts = rowSums(dinucCpG)
For a four state Markov chain as we have, we define the transition matrix as a matrix where the rows are the from state and the columns are the to state.
TI = matrix( IslCounts, ncol = 4, byrow = TRUE)
TnI = matrix(NonICounts, ncol = 4, byrow = TRUE)
dimnames(TI) = dimnames(TnI) =
list(c("A", "C", "G", "T"), c("A", "C", "G", "T"))
We use the counts of numbers of transitions of each type to compute frequencies and put them into two matrices.
MI = TI /rowSums(TI)
MI
## A C G T
## A 0.20457773 0.2652333 0.3897678 0.1404212
## C 0.20128250 0.3442381 0.2371595 0.2173200
## G 0.18657245 0.3145299 0.3450223 0.1538754
## T 0.09802105 0.3352314 0.3598984 0.2068492
MN = TnI / rowSums(TnI)
MN
## A C G T
## A 0.3351380 0.1680007 0.23080886 0.2660524
## C 0.3641054 0.2464366 0.04177094 0.3476871
## G 0.2976696 0.2029017 0.24655406 0.2528746
## T 0.2265813 0.1972407 0.24117528 0.3350027
Quiz question 11 : Which nucleotide transitions are most affected between the Islands and Non-Islands, ie between MI and MN?
Are the relative frequencies of the different nucleotides different in CpG islands compared to elsewhere ?
freqIsl=alphabetFrequency(seqCGI,baseOnly=TRUE,collapse=TRUE)[1:4]
freqIsl / sum(freqIsl)
## A C G T
## 0.1781693 0.3201109 0.3206298 0.1810901
freqNon=alphabetFrequency(seqNonCGI,baseOnly=TRUE,collapse=TRUE)[1:4]
freqNon / sum(freqNon)
## A C G T
## 0.3008292 0.1993832 0.1993737 0.3004139
Quiz question 12 : Which nucleotides are most prevelant in CpG islands?
Quiz question 13 : Which nucleotides are most prevelant in non-islands?
But how can we decide if a new sequence comes from a CpG island? What is the probability it belongs to a CpG island compared to somewhere else? We compute a score based on what is called the odds ratio, i.e $( $P(sequence | island)/P(sequence| nonisland) \()\). See the textbook for details as to how this is evaluated.
alpha = log((freqIsl/sum(freqIsl)) / (freqNon/sum(freqNon)))
beta = log(MI / MN)
scorefun = function(x) {
s = unlist(strsplit(x, ""))
score = alpha[s[1]]
if (length(s) >= 2)
for (j in 2:length(s))
score = score + beta[s[j-1], s[j]]
score
}
x = "ACGTTATACTACG"
scorefun(x)
## A
## -0.2824623
In the code below, we pick sequences of length len = 100 out of the 2855 sequences in the seqCGI object, and then out of the 2854 sequences in the seqNonCGI object (each of them is a DNAStringSet). In the first three lines of the generateRandomScores function, we drop sequences that contain any letters other than A, C, T, G; such as “.” (a character used for undefined nucleotides). Among the remaining sequences, we sample with probabilities proportional to their length minus len and then pick subsequences of length len out of them. The start points of the subsequences are sampled uniformly, with the constraint that the subsequences have to fit in.
generateRandomScores = function(s, len = 100, B = 1000) {
alphFreq = alphabetFrequency(s)
isGoodSeq = rowSums(alphFreq[, 5:ncol(alphFreq)]) == 0
s = s[isGoodSeq]
slen = sapply(s, length)
prob = pmax(slen - len, 0)
prob = prob / sum(prob)
idx = sample(length(s), B, replace = TRUE, prob = prob)
ssmp = s[idx]
start = sapply(ssmp, function(x) sample(length(x) - len, 1))
scores = sapply(seq_len(B), function(i)
scorefun(as.character(ssmp[[i]][start[i]+(1:len)]))
)
scores / len
}
scoresCGI = generateRandomScores(seqCGI)
scoresNonCGI = generateRandomScores(seqNonCGI)
Now, we can use ggplot() to view the distribution of
computed scores for the two distributions:
df = rbind(
data.frame(region = "cgi", score = scoresCGI),
data.frame(region = "ncgi", score = scoresNonCGI)
)
ggplot(df, aes(x = score, fill = region)) +
geom_histogram(alpha = 0.5, color = "black", position = "identity", bins=35)